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Abstract 

Homology detection is critical to genomics. Identifying homologous 
sequence allows us to transfer information gathered in one organism to 
another quickly and with a high degree of confidence. Non-coding RNA 
(ncRNA) presents a challenge for homology detection, as the primary 
sequence is often poorly conserved and de novo structure prediction re- 
mains difficult. This chapter introduces methods developed by the Rfam 
database for identifying "families" of homologous ncRNAs from single 
"seed" sequences using manually curated alignments to build powerful 
statistical models known as covariance models (CMs). We provide a brief 
overview of the state of alignment and secondary structure prediction 
algorithms. This is followed by a step-by-step iterative protocol for iden- 
tifying homologs, then constructing an alignment and corresponding CM. 
We also work through an example, building an alignment and CM for the 
bacterial small RNA MicA, discovering a previously unreported family 
of divergent MicA homologs in Xenorhabdus in the process. This chap- 
ter will provide readers with the background necessary to begin defining 
their own ncRNA families suitable for use in comparative, functional, and 
evolutionary studies of structured RNA elements. 

1 Introduction 

Alignment is a central problem in bioinformatics. A wide range of critical ap- 
plications in genomics rely on our ability to produce "good" alignments. Single- 
sequence homology search as implemented in tools such as BLAST 1] is an (often 
heuristic) application of alignment. The sensitivity and specificity of homology 
search can be improved by the use of evolutionary information in the form of 
accurate substitution and insertion-deletion (indel) rates derived from multi- 
ple sequence alignments (MS As), captured in the statistical models used by 
HMMERJ5] and Infernal [5] for protein and RNA alignments respectively. These 
models can be interpreted as defining "families" of homologous sequences, as in 
the Pfam and Rfam databases[H[5]. By using these models to classify sequences, 
we can infer functional and structural properties of uncharacterized sequences. 

Unfortunately, producing the high-quality "seed" alignments of RNA these 
methods require remains difficult. While proteins can be aligned accurately 
using only primary sequence information with pairwise sequence identities as 



low as 20% for an average-length sequence^ [7], it appears that the "twilight 
zone" where blatantly erroneous alignments occur between RNA sequences may 
begin at above 60% identity [8]. The inclusion of secondary structure informa- 
tion can improve alignment accuracy [9], but predicting secondary structure is 
not trivial[TU]. An instructive example of the difficulties this can lead to is 
the case of the 6S gene, a bacterial RNA which modulates a 70 activity dur- 
ing the shift from exponential to stationary growth. The Escherichia coli 6S 
sequence was determined in 1971 [TT] and its function determined in 2000 [12j. 
However, the extent of this gene's phylogenic distribution was not realized un- 
til 2005 when Barrick and colleagues carefully constructed an alignment from 
a number of deeply diverged putative 6S sequences, and through successive 
secondary-structure aware homology searches demonstrated its presence across 
large swaths of the bacterial phylogeny[T3]- Even now, new homologs are discov- 
ered on a regular basis [Ht 115]. and 6S appears to be an ancient and important 
component of the bacterial regulatory machinery. 

It is our hope to make these techniques accessible to sequence analysis 
novices. This chapter aims to introduce the techniques necessary to construct 
a high-quality RNA alignment from a single seed sequence, and then use the 
information contained in this alignment to identify additional more distant ho- 
mologs, expanding the alignment in an iterative fashion. These methods, while 
time-consuming, can be far more sensitive than a BLAST search [TB"]. We will 
briefly review the state of the art in RNA sequence alignment and structure 
prediction. We then present a brief protocol which starts with a single se- 
quence, and then uses a collection of web and command-line based tools for 
alignment, structure prediction, and search to construct an Infernal covariancc 
model (CM), a probabilistic model which captures many important features of 
structured RNA sequence variation[5]- These models may then be used in the 
iterative expansion of alignments or for homology search and genome annota- 
tion. CMs are also are used by the Rfam database in defining RNA sequence 
families, and are the subject of a dedicated RNA families track at the journal 
RNA Bioloay^T7\. Finally, we present as an instructive example the construc- 
tion of an RNA family for the enterobacterial small RNA MicA, discovering a 
convincing divergent clade of homologs in the process. 

1.1 RNA Alignment and Secondary Structure Prediction 

RNA sequence alignment remains a challenge despite at least 25 years of work 
on the problem. As discussed above, alignments based on primary sequence 
become highly untrustworthy below 60% pair-wise sequence identity, likely 
due to the lower information content of individual nucleic acids as compared 
to amino acids in protein alignments. This can be intuitively understood by 
recalling the fact that 3 nucleic acids are required to encode an individual amino 
acid; so, an amino acid carries 3 times as much information as a nucleic acid 
(a bit less, actually, due to the redundancy of the genetic code). In addition, 
the larger alphabet size of protein sequences allows for the easy deployment of 
more complex substitution models, and a glut of protein sequence data allows 
for highly effective parameterization of these models. 

The incorporation of secondary structure, i.e. base-pairing, information has 
been proposed as a means to make up for these difficulties in RNA alignment 
methods. The first proposal for such a method is now known as the Sankoff 
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algorithm[IS]. The Sankoff algorithm uses dynamic programming, an optimiza- 
tion technique long central to to sequence analysis^] Dynamic programming had 
previously been applied to the problems of sequence alignment 22J and RNA 
folding|23 . Sankoff proposed a union of these two methods. Unfortunately 
the resulting algorithm has a time requirements of 0(L 3N ) and space require- 
ments of 0(L 2N ) where L is the sequence length and N is the number of se- 
quences aligned. This is impractical, even for small numbers of short sequences. 
A number of faster algorithms have been developed to approximate Sankoff 
alignment. Recent examples include Centroid Align [24] . mLocARN A [25j . and 
FoldalignM[26 . These methods can push the RNA alignment twilight zone as 
low as 40 percent identity [5]. 

However, for the purpose of family-building, we are often starting with a 
single sequence of unknown secondary structure, and have to gather additional 
homologs using a fast alignment tool, such as BLAST. Such methods are not able 
to reliably detect homologs below 60 percent sequence identity. In this range 
of pair-wise sequence identities, the slight increases in accuracy of Sankoff-type 
algorithms over non-structural alignment is only rarely worth the additional 
computational costs involved^ Alignments generated with standard alignment 
tools can then be used as a basis for predictions of secondary structure using 
tools like Pfold[28], RNAalifold[29], or CentroidFold[30]. 

Regardless, all modern alignment tools, Sankoff-type or standard, suffer from 
a number of known problems. Most alignment tools use progressive alignment. 
This means that the aligner decomposes the alignment problem in to a series 
of pair-wise alignment problems along a guide tree. This greatly reduces the 
computational complexity of the alignment problem, but means that any error 
in an early pair- wise alignment step is propagated through the entire alignment. 
A number of solutions have been proposed to this problem, such as explicitly 
modeling insertion-deletion histories|31j or using modified or alternative opti- 
mization methods such as consistency-guided progressive alignment [32. or se- 
quence annealing [33]. A second issue is that it is not clear which function of 
the alignment aligners should be optimizing, and many appear to over-predict 
homology 34, 2"7 ] I35j ■ Finally, many parameters commonly used in alignment, 
such as gap opening and closing probabilities and substitution matrices, appear 
to vary across organisms, sequences, and even positions within an alignment. 
All of this leads to considerable uncertainty in alignment [35], which is not eas- 
ily captured by most current alignment methods. The additional parameters 
introduced by RNA secondary structure prediction only compounds these these 
problems. 

A final problem with alignment is the issue of determining whether two se- 
quences are similar due to homology or analogy. Homology describes a similarity 
in features based on common descent; for instance, all bird wings are homolo- 
gous wings. Analogy, on the other hand, describes a similarity in features based 
on common function without common descent; bat and bird wings perform the 

1 A full explanation of dynamic programming is beyond the scope of this book chapter, 
but for a brief introduction see two excellent primers by Sean Eddy covering applications to 
alignment 19, and secondary structure prediction|20|; for those seeking a deeper understanding 
Durbin et al. |21| provides coverage of dynamic programming as well as covariance models. 

2 For recent benchmarks of alignment tools on ncRNA sequences see Hamada et al. |24] and 
the supplementary information of Bradley et qj. |27l ; Hamada includes comparisons of aligner 
runtimes, while Bradley examines relative performance over a range of pair-wise sequence 
percent identities. 
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same function, and appear superficially similar. However, their evolutionary 
histories are quite different. In sequence analysis, we often assume that aligned 
residues within an alignment share common ancestors, but this assumption can 
be confounded by analogous sequence. These analogs often take the form of mo- 
tifs, short sequences which perform specific functions within the RNA molecule 
and can arise easily through convergent evolution. An example of such a motif 
is the bacterial rho-independent terminator [37], a short hairpin responsible for 
halting transcription in many species. While such motifs can be a boon in dis- 
covering novel ncRNA genes [38] or aligning homologs which contain them, they 
can also be a source of false-positives when attempting to build an alignment of 
homologous sequences. 

Rfam has developed a pipeline designed to address many of these problems 39J . 
Starting from a single sequence, we iteratively expand an alignment using In- 
fernal covariance models. During each iteration, we use a variety of automatic 
alignment and secondary structure prediction tools together with manual cura- 
tion and editing in an effort to avoid many of the issues raised above. While the 
Rfam pipeline is designed to run on a high-end computational cluster, we have 
adapted the process here to make it accessible to anyone with a commodity PC 
and an internet connection. 

2 Materials 

2.1 Single Sequence Search 

We rely on NCBI BLAST [I] to quickly identify close homologs of RNA sequences 
in this protocol. NCBI and EMBL-EBI both maintain servers [40] |4T] with 
slightly different interfaces, though there are no substantive differences in the 
implementations. We use the NCBI server here. EBI also maintains servers for 
a number of BLAST and FASTA derivatives, which may be helpful. Both sites 
also allow users to BLAST against databases of expressed sequences including 
GEO at NCBI, and high throughput cDNA and transcriptome shotgun assembly 
databases at EMBL-EBI. Such searches can be helpful for gathering comparative 
expression data for your ncRNA. 

A nucleotide version of the HMMER3 package [2] for sequence search is cur- 
rently in development which promises both increased sensitivity and specificity 
over BLAST at little additional computational cost. We hope that a web server 
similar to the one currently available for protein sequences [42] will be forthcom- 
ing. If it is possible that homologous sequences are spliced (e.g. introns in the 
U3 snoRNA[43 ), then a splice-site aware search method may be useful, such as 
BLAT[33] or Genome Wise [35], but there are not publicly available webservers 
for them that we are aware of. 
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http: / /blast. ncbi.nlm.nih.gov/Blast.cgi 


http://www.ebi.ac.uk/Tools/sss/ncbiblast/ 


http://www.ebi.ac.uk/Tools/sss/ 


http: / /hmmer .janelia.org/search 
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2.2 Alignment and Secondary Structure Prediction Tools 



We find it best to run a variety of alignment and secondary structure predic- 
tion tools simultaneously. Each has its own peculiarities, and our hope is that 
by looking for shared homology and secondary structure predictions we can 
mitigate some of the problems discussed in the introduction. In this proto- 
col, we use the WAR webserver [IS] which allows the user to run 14 different 
methods simultaneously. These include Sankoff-type methods: FoldalignM 26J, 
LocARNA[25], MXSCARNAgZj, MurletjUj, and StrALg9] + PETcofold[50]; 
Align-then-fold methods, which use a traditional alignment tool (ClustalW|511 
[5J] or MAFFT[531[S1]) followed by structure prediction (RNAalifold[2Hl [55] or 
Pfold[28 ); Fold-then-align methods, which predict structures in all the input se- 
quences and attempt to align these structures fRNAcast[5B] + RNAforester|57|'l: 
Sampling methods which attempt to iteratively refine alignment and structure: 
MASTR[58] and RNASampler[59| : and other methods which do not fit in to the 
above traditional categories: CMfinder[5D] and LaRA[CT], Finally, WAR also 
computes a consensus alignment using the alignments produced by all user- 
selected methods as input to the T-Coffee consistency-based aligner 32j . 

However, WAR is by no means exhaustive, and the applications may not 
be the most recent versions available. A number of groups maintain their 
own servers for RNA sequence analysis. Notable servers include the Vienna 
RNA WebServers P]. the Freiburg RNA Tools[53], the CBRC Functional RNA 
Project [64 , and the Center for Non-Coding RNA in Technology and Health 
(RTH) Resources page. In addition, EMBL-EBI maintains a number of web- 
servers for popular multiple sequence alignment alignment tools. Ultimately, as 
you become more comfortable with RNA sequence analysis you may want to 
begin installing and running new tools on a local *NIX machine; however, this 
is beyond the scope of the current chapter. 
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2.3 Genome Browsers 

Genome browsers are essential for checking the context of putative homologs. 
The ENAJJT] provides a no- frills sequence browser perfect for quickly check- 
ing annotations. For deeper annotations, the UCSC genome broswerj65] and 
Ensembl 6(5] both contain a wide range of information for the organisms they 
cover. For bacterial and archaeal genomes, the Lowe lab maintains a modified 
version of the UCSC genome browser [67 which provides a number of tracks of 
particular interest to those working with ncRNA. The CBRC Functional RNA 

3 Currently amino acid only 
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Project maintains a UCSC genome browser mirror[B3] for a number of eukary- 
otic organisms with a larger number of ncRNA-related tracks. 
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http://www.ebi.ac.uk/ena/ 



http: / / genome.ucsc.edu/ 



http: / /www . ensem bl. org| 
http: / / microbes.ucsc.edu/ 



http:/ /www. ncrna.org/glocal/cgi-bin/hgGateway 



2.4 Alignment Editors 

It is possible to edit alignments in any text editor; however we highly recom- 
mend using a secondary structure-aware editor such as Emacs with the RALEE 
major modejBS]. RALEE allows you to color bases according to base identity, 
secondary structure, or base conservation. It also allows the easy manipulation 
of sequences which are involved in structural interactions but are not close in 
sequence space through the use of split screens. A number of other special- 
ized RNA editors are available: Boulder ALE[69] and S2S[70] both allow the 
end user to visualize and manipulate tertiary structure in addition to secondary 
structure, and may be particularly useful if crystallographic information is avail- 
able for your RNA. Other alternatives for editing RNA secondary structure are 
SARSE[7l] and MultiSeq 721 ■ Recent versions of J al View [73] have begun to sup- 
port RNA secondary structure as well, though this functionality isn't completely 
mature at the time of writing (late 2011.) 



Resource 


Reference 


URL 


RALEE 

BoulderALE 

S2S 

SARSE 

MultiSeq 

JalView 


EH] 

[69] 

™. 

HQ 

m 
m 


http://personalpages.manchester.ac.uk/staff/sam.grifhths-jones/software/ralee 


http: / /www.microbio.me/boulderale 


http: / /bioinformatics.org/S2S / 


http: / /sarse. ku.dk/ 


http://www.ks.uiuc.edu/Research /vmd / plugins / multiseq/ 


http: / /www .jalview.org 



2.5 Infernal 

The centerpiece of our protocol is the Infernal package for constructing co- 
variance models(CMs) from RNA multiple alignments [3]. We will use this to 
construct models of our RNA family. CMs model the conservation of positions 
in an alignment similar to a hidden Markov model (HMM), while also capturing 
covariation in structured regions [7H 1751 |2"T] . Covariation is the process whereby 
a mutation of a single base in a hairpin structure will lead to selection in sub- 
sequent generations for compensatory mutations of its structural partner in 
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order to preserve canonical base-pairing, ie: Watson-Crick plus G-U pairs, and 
a functional structure. This combination of structural-evolutionary informa- 
tion has been shown to provide the most sensitive and specific homology search 
for RNA of any tools currently available [9j [76]. Unfortunately, this sensitivity 
and specificity come at a high computational cost, and Infernal searches can 
be time-consuming with genome-scale searches often taking hours on desktop 
computers. The development of heuristics to reduce this computational cost is 
an area of active research for the Infernal team, and has already been mitigated 
to some extent by the use of HMM filters and query-dependent banding of align- 
ment matrices[77]. We refer the reader to Eric Nawrocki's excellent primer on 
annotating functional RNAs in genomic sequence for a friendly introduction to 
the mechanics of the Infernal package 78J. 
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EH [75] 


http:/ /infernal.janclia.org/ 



3 Methods 

We assume for the sake of this protocol that you are starting with a single se- 
quence of interest. If you already have a set of putative homologs, you may wish 
to further diversify your collection of sequences using the methods described in 
section 3.1, or you may skip directly to section 3.2, or 3.4 if a secondary structure 
is known. No matter how many sequences you are starting with, it is always a 
good idea to run the tools available on the Rfam website (rfam.sanger.ac.uk) on 
them. This will verify that there isn't already a CM available that covers your 
sequences. There are a number of other specialist databases that may also be 
worth searching if you have reason to believe your RNA sequence is a member 
of a well-defined class of RNAs, i.e. microRNAs, snoRNAs, rRNAs, tRNAs, etc. 
We have previously reviewed these databases in another book chapter [75]. A 
generic RNA sequence database aiming to capture all known RNA sequences, 
RNAcentral[80 is currently in development and will provide a resource for easily 
identifying similar sequences with some evidence of transcription. 

3.1 Gathering an initial set of homologous sequence 

Now that you've confirmed that your sequence is novel, we will use NCBI- 
BLAST to identify additional homologous sequences. Once you've navigated to 
the nucleotide BLAST server there are a number of important options to set. 

3.1.1 Setting NCBI-BLAST Parameters 

First, it is important to choose a search set appropriate to your sequence. At 
this initial phase, we want to limit our exposure to sequences which are very 
distant from ours to avoid the number of obviously spurious alignments we will 
need to examine, increasing the power of our search. So, if your initial sequence 
is of human origin, you may want to limit your search to Mammalia, Tetrapoda, 
or Vertebrata depending on sequence conservation. Similarly, if you are working 
with an Escherichia coli sequence, you may want to limit your initial searches 
to Enterobacteriaceae or the Gammaproteobacteria. NCBI-BLAST searches are 
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relatively fast, so try several search sets to get a feel for how conserved your 
sequence is. 

The second set of options to set is the "Program Selection" and the "Algo- 
rithm Parameters" . We recommend blastn as it allows for smaller word sizes. 
The word size describes the minimum length of an initial perfect match needed 
to trigger an alignment between our query sequence and a target. Smaller word 
sizes provide greater sensitivity, and seem to perform better for non-coding 
RNAs. We recommend a word size of 7, the smallest the NCBI-BLAST server 
allows. 

Finally, you should set "Max Target Sequences" parameter to at least 1000. 
NCBI-BLAST returns hits in a ranked list from best match to worst by E-value 
(or the number of matches with the same quality expected to be found in a 
search over a database of this size), and will only display as many as "Max 
Target Sequences" is set to. We are primarily interested in matches on the edge 
of what NCBI-BLAST is capable of detecting reliably, and these will naturally 
fall towards the end of this list. 

3.1.2 Selecting Sequences 

Our goal at this stage is to pick a representative set of homologous sequences 
to "seed" our alignment with. As discussed in the introduction, single sequence 
alignment for nucleotides is generally only reliable to approximately 60 per- 
cent pair-wise sequence identity. At the same time, picking a large number 
of sequences with high percent identity can lead to overfitting of the secondary 
structure; that is, if our sequences are too similar we can end up predicting align- 
ments and secondary structures which capture accidental features of a narrow 
clade, rather than the biologically relevant structure and sequence variation. 

There are 3 major criteria we pick additional sequences based on, in rough 
order of importance: percent sequence identity, taxonomy, and sequence cov- 
erage. Handily, the NCBI-BLAST output displays measures of all of these. 
Our first selection criterion, percent identity, should fall between 65% and 95%; 
much lower and the sequence will be difficult to align, higher and it will be too 
similar to have any meaningful variation. 

The second selection criterion, taxonomy, will depend somewhat on the or- 
ganisms your sequence is associated with, but we generally want to limit the 
inclusion to a single (orthologous) instance per species. The exception to this 
rule is for diverged paralogous sequences within the species; if paralogs exist, 
you will need to decide how broadly you wish to define your family. Addition- 
ally, it may be useful to further limit the maximum percent identity to, say, 
90% within a genus to further limit the number of highly similar sequences in 
your initial alignment. 

Finally, assuming that you are sure of your sequence boundaries, we want 
to select sequences that cover the entire starting sequence. If you see many 
matches covering only a short section of your sequence, this may be due to 
the matching of a short convergent motif. This most commonly happens with 
the relatively long, highly-constrained bacterial rho-independent terminators, 
but may occur with other motifs. Alternatively, if you do not have well-defined 
sequence boundaries, you will need to determine these from the conservation you 
see in your BLAST hits - look for taxonomically diverse hits covering the same 
segment of your query sequence. In some cases, such as the long non-coding 
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RNAs, conserved domains may be much shorter than the complete transcribed 
sequence, but stay aware of the potential motif issue. A taxonomic distribution 
of sequences that makes biological sense given your knowledge of the molecule's 
function and that can be explained by direct inheritance of the sequence will be 
your best guide. 

3.1.3 Examining Your Initial Homolog Set 

Once you have assembled a set of sequences fitting the criteria described above, 
it is worth taking a closer look at them. Remember that these sequences will 
form the core of your alignment and CM, and errors at this stage can dramat- 
ically bias your results. A good first test is to examine the taxonomy of your 
sequences, and make sure it makes sense. Can you identify a clear pattern of 
inheritance that might explain the taxonomic distribution you see at this stage? 
Another good check is to examine your sequences in the ENA browser, or a 
domain-specific browser if one exists for your organisms. For many indepen- 
dently transcribed RNAs, genomic context is more conserved than sequence, 
and ncRNA genes will often fall in homologous intergenic or intronic regions 
even at large evolutionary distances. If you are particularly ambitious, and 
the tools are available for your organisms of interest, you may wish to try to 
identify promoter sequence upstream of your candidate or terminator sequence 
downstream. If your sequence is a putative cis-regulatory element, such as a 
riboswitch, thermosensor, or attenuator, you may want to check that it occurs 
upstream of genes with similar functions or in similar pathways. Finally, it is 
always worth searching your putative homologs through the Rfam website even 
if your initial sequence had no matches - Rfam's models are not perfect, and 
may miss distant homologs of known families. 

3.2 Aligning and predicting secondary structure 

We will use the WAR servers to construct an initial alignment. Because of 
the criteria we've set for sequence similarity in our gathering step, all of the 
sequences in our initial homolog set should have at least 60% pairwise se- 
quence identity with at least one other sequence in the set. Under these con- 
ditions sequence-only alignment methods using primary sequence information 
only can preform adequately, as discussed previously These methods combined 
with alignment folding tools which identify for conserved structural signals and 
covariation can produce reasonable predicted secondary structures [TD]. How- 
ever it is still often useful to observe the behavior of as many alignment tools 
as possible. Using WAR, for a fairly fast alignment we recommend running 
CMfinder[60], StrAL+PETfold[49l EU], ClustalW[5U H2] and MAFFT[53l [54] 
with RNAalifold[29l 155] and Pfold[2H]- WAR will also produce a consensus 
alignment using T-Coffec[32 , which will attempt to find an alignment consis- 
tent with all of the individual alignments produced by other methods. 

Once WAR returns your alignment results, there are a number of things 
you should take a note of that will assist you in picking an alignment and 
further in manual refinement. First, the consensus alignment page will display 
a graphical representation of the consistency of the alignments which will allow 
you to quickly tell which areas of the alignment may require attention during 
manual refinement, or areas that may harbor structure not captured by the 
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T-coffea consensus alignment 

IaDAVG^H 

286415 GAAAGACGCGCAUUUGUUAUCA UCAUCCCUGUUAUCAGAGAUGOUAAUU DGGCC 

P002272 GAAAGACGCGCAUUUGUUAUCAUC AUCCOTGACUUCAGAGAUGAAAUGUUaGGCC 

P002433 GAAAGACGCGCAUUUGUUAUCA UCAUCCCUGACAACAGAGAUGUUAAUUC GGCC 

P002910 GAAAGACGCGCAUUUAUUAUCAUCA UCAUCCCUGA AUCAGAGAUGAAAGUUU GGCC 

P2 3 6 S 42 GAAAGACGCGnADOUGUnAUCAUCAUCOCAnCCCnGACAACAGAGAOGOUAAUUDAGGCC 

2312003 GAAAGACGCGCAUUUGUUAUCA UCAUCCCUGUUUUCAGCGAUGAAAUUU UGGCC 

Q00 96_2 GAAAGACGCGCAUUUGUUAUCA UCAUCCCUGAAUUCAGAGAUGAAAUUU OGGCC 

CONS STRUCTURE . 

P002272 ACAGU GAUGUGGCCUUUUU 

P002433 ACAGU GAUGUGGCCU-UUU 

P002 910 ACAGU GAUGUGGCCUUUUU 

P236642 ACAGU GACGUGGCCUUUUU 

3312003 ACUCCGUGAGUGGCCU U U UU 

000 9 6_2 ACUC ACGAGUGGCCUUUUU 
CONSSTRUCTURE^^^f . ^^^^^J^^T 

Figure 1: T-coffee consensus alignment for close MicA homologs produced by WAR, colored 
for alignment consistency between methods. Due to the high percent identity in these se- 
quence, the alignments are highly consistent, though even here the areas of lower consistency 
are informative for manual refinement - see section 4. 



majority consensus. The consensus can be recomputed based on differing subsets 
alignment methods, if you believe one method (or set of methods) may be unduly 
influencing the consensus. Once you've carefully looked over the consensus 
alignment, examine each alignment produced by WAR in turn: What structures 
are shared? Where do the alignments differ from each other? Can you identify 
any sequence or structural motifs which may help to guide your alignment? At 
this level of sequence identity, you should hope to see fairly consistent alignments 
in functional regions of the sequence, interspersed with more difficult to align 
regions, presumably under less severe selective pressure. Often the consensus 
alignment is a good choice to move forward with. However, there are cases where 
certain classes of tools will obviously mis-align regions of the sequence and bias 
the consensus. Keep in mind what you've seen in the alternative alignments as 
well; this information may be useful in manual refinement. You will want to 
save the Stockholm file for the alignment you've chosen to your local computer 
at this point. 

Later in the family-building process when you have identified more distant 
homologs, the average pair-wise identity of the sequences in your data set may 
have dropped below 60%. At this point, you may want to begin including some 
of the Sankoff-typc alignment methods available in WAR. Using these methods 
can dramatically increase the runtime for your sequence alignment jobs, though, 
particularly for sequences over a couple of hundred of bases long. We will 
discuss alternatives to re-aligning sequences during the iterative expansion of 
the alignment in section 3.5. 

3.3 Manually refining alignments 

Our goal in manual refinement is to attempt to correct errors made by auto- 
matic alignment tools. We generally use RALEE[68]. an RNA editing mode for 
Emacs, for editing alignments. However, any editor you are comfortable with in 
which you can easily visualize sequence and structural conservation will work; 
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a number of alternative editors are listed in the Materials section. 

A good place to start editing is around the edges of predicted hairpin struc- 
tures. Are there base-pairs which appear to be misaligned? Can you add base- 
pairs to the structure? Are there predicted base-pairs which don't appear to be 
well conserved that should be trimmed? Can individual bases be moved in the 
alignment to create more convincing support for the predicted structure? 

Once you are satisfied with your manual refinement of predicted secondary 
structure elements, next you should turn your attention to areas identified as 
uncertain in the WAR/T-Coffee consensus alignment. Were there alternative 
structures predicted in these regions? Do you see support for these structures in 
the sequences? If these regions are unstructured, can you identify any conserved 
sequence motifs in the region? If you will be regularly working with a particular 
class of ncRNA, it can be useful to familiarize yourself with predicted binding 
motifs of associated RNA-binding proteins as these are likely to be conserved 
but can have many variable positions. 

At this stage, it is also possible to include information from experimental 
data. Crystal structure information from a single sequence in the SEED align- 
ment can be used to validate and improve a predicted secondary structure. 
Tertiary structure- aware editors such as Boulder Ale [69] can help in applying 
this information to the alignment. Other experimental evidence, such as chem- 
ical footprinting can also provide valuable information. Knowing whether even 
a single base is involved in a pairing interaction can drastically reduce the space 
of possible structures the sequence can fold in to, simplifying the problem of 
predicting secondary structure. Both the RNAfold and RNAalifold web servers 
available through the Vienna RNA website [62 are capable of taking advan- 
tage of this information in the form of folding constraints. We hope that these 
sorts of datasets will become widely available in consistent formats in the near 
future [H]. 

3.4 Building a covariance model 

For those comfortable with the *NIX command line, building an Infernal CM is 
fairly straight-forward. We refer the reader to the User's Guide available from 
the Infernal website (http://infernal.janelia.org) for installation instructions and 
a detailed tutorial. The basic syntax to build and calibrate a family is: 

> cmbuild my. cm my.sto 

> cmcalibrate my. cm 

The first command constructs the CM (my . cm) from the alignment you've 
carefully curated (my.sto). The second command calibrates the various filters 
Infernal uses to accelerate its search using simulated sequences generated from 
the CM. Note that calibration can take a long time - hours for longer mod- 
els. You can get a quick estimate of the time calibration will take using the 
command: 

> cmcalibrate — forecast 1 my. cm 

Congratulations! You should now have a working CM for your RNA family. 
This is a fully capable model, and can be used as is for homology search and 
genome annotation. However, as it stands, your CM will only capture the 
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sequence diversity which was able to be detected by our initial BLAST search. 
In order to fully take advantage of the power of CMs, it is necessary to expand 
the diversity of the sequence it is trained on through iterative expansion of our 
initial set of sequence homologs. 

3.5 Strategies for expanding model coverage 

3.5.1 Plan A: Iterative search of sequence databases 

The method Rfam uses to identify more divergent homologs to seed sequences 
is to pre-filter CM-based searches with sequence-based homology search tools. 
This allows us to cover a large sequence space with a (comparatively) modest 
investment of computational time. Any of the single sequence search tools 
mentioned in section 2.1 would make an effective pre-filter. 

The easiest way to preform filtering yourself is to use the NCBI BLAST 
webserver to search each sequence in your seed alignment following the methods 
outlined for collecting your initial set of homologs in section 3.1. You may wish 
to relax the criteria slightly then use the CM to preform a more sensitive search 
on this set of filtered sequences. This will enable you to detect more distantly 
related sequences, though you should always examine sequence context and the 
phylogenetic relationship between sequences as a sanity check before including 
them in your seed. These methods can be automated with basic scripting and 
bioinformatics modules such as BioPerl[52"] or Biopython(83j, though this is 
beyond the scope of this chapter. 

Once you have identified a new set of homologs, you can align them to your 
previous CM using Infcral's cmalign: 

> cmalign my. cm newsequences . f asta > newsequences . sto 

This alignment can then be merged with your original alignment: 

> cmalign — merge my. cm my. sto newsequences . sto > combined_alignment . sto 

This alignment can then be used to build a new CM, which will capture the 
additional sequence variation you have discovered in your BLAST searches. 

The disadvantage of this method is that each search only uses the information 
available in a single sequence, meaning that valuable information about variation 
is lost and as a result the power of the search suffers. Fast profile-based methods 
such as HMMER30 will hopefully remedy this problem in the near future, but 
these methods are not mature for DNA and RNA sequence at the present. Older 
versions of HMMER can be used to search DNA sequence with increased power, 
but they require more computational resources than BLAST (though far less 
than Infernal) and need to be used at the command-line. 

3.5.2 Plan B: Directed search of chosen sequences 

Another approach is to run the unfiltered CM over selected genomes or genomic 
regions. While the greater sensitivity and specificity of this method can help 
identify more distant homologs than is possible with BLAST, it has the disad- 
vantage that it requires a much larger investment of computational resources to 
provide an equivalent phylogenetic coverage. This method can be particularly 
powerful in bacterial and archaeal genomes, where small genome size allows us 
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to search a phylogenetically-representative sample of genomes in less than a day 
on a desktop computer. In the case of larger eukaryotic genomes, it may be 
necessary to search a few genomes to determine if homologs of your RNA are 
likely to exist in certain lineages, then extract homologous intergenic regions 
to continue searching. Our rationale here is much the same as in limiting the 
database for our initial BLAST search: by only looking in genomes where we 
have some prior belief that they may contain homologous sequence we reduce 
the noise in our low-scoring hits, meaning that we have to manually examine 
less hits to establish a score threshold for likely homologs. 

Once you have examined candidates following the principles outlined earlier, 
it is easy to incorporate your new sequences using the easel package included 
with Infernal. First, search the genome generating a tabfile: 

> cmsearch — tabfile searchf ile .tab my. cm genome. fasta 

Then use easel to index the genome and extract the hits: 

> esl-sfetch — index genome. fasta 

> esl-sfetch — tabfile genome. fasta searchf ile .tab > hits. fasta 

These sequences can then be aligned and merged as with BLAST hits. Al- 
ternatively, if you discover a divergent lineage, it may be easiest to construct 
a separate alignment for these sequences, then use shared structural and se- 
quence motifs to manually combine the two alignments. Sankoff-type alignment 
method may also be useful for aligning divergent clades. 

3.5.3 Plan C: When A and B fail... 

In some cases, it will be very difficult to identify homologs of a candidate RNA 
across its full phylogenetic range. This can be because of high sequence vari- 
ability, as in the Vault RNAs[5i]. Alternatively, some longer RNAs, such as the 
RNA component of the telomerase ribonuceloprotein, consist of well-conserved 
segments interspersed with long variable regions which can't be easily discovered 
by standard search with naive covariance models. 

A number of computational techniques exist for approaching these difficult 
cases, reviewed by Mosig and colleagues [85] . These methods include fragrep2;86 i , 
which allows the user to search fragmented conserved regions, fragrep3, which 
allows the user to incorporate custom structural motifs with fragmented search, 
and GotohScan|87). which implements a semi-global alignment algorithm that 
will align a query sequence to a (potentially) extended genomic region. 

4 An example: MicA 

We will now illustrate some of the concepts we've discussed using the example 
of MicA, an Hfq-dependent bacterial trans-acting antisense small RNA (sRNA). 
Many bacterial sRNAs are similar in function to eukaryotic microRNAs, pairing 
to target mRNA transcripts through a short antisense-binding region, generally 
targeting the transcript for degradation [88]. MicA is known to target a wide- 
range of outer membrane protein mRNAs using a 5' binding-region in both 
E. coli 89J and S. enterica\9i)\ in response to membrane stress. The current 
covariance model for MicA (accession RF00078) in Rfam (release 10.1) is largely 
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restricted to E. coli, S. enterica, and Y. pestis. Here, as an example, we will 

attempt to improve on this model using the methods we've described in this 

chapter. In the process, we discover previously unreported homologs in the 

nematode symbionts of the Gammaproteobacterial genus Xenorhabdus. 

For our starting point, we are using the MicA sequence from Gisela Storz's 
spreadsheet of known E. coli sRNAs}9T]: 

MicA: G AA AGACGCGCATTTGTTATCATCATCCCTGAATTCAGAG ATGAAATTTTGGCC ACTCACGAGTGGCCTTTTT 

It is a useful exercise to compare the single sequence predicted secondary struc- 
tures for this sequence and the E. coli sequence from the current Rfam SEED 
alignment (see Figure 2). This illustrates that even for nearly identical sequences, 
single sequence structure prediction methods can give divergent results. Other 
important features to notice are that the 3' hairpin shared by the predicted 
structures appears to be a rho-independent terminator, and this could be con- 
firmed with a motif hunting tool|37j and used during manual curation. 




Figure 2: Alternative structures predicted by the RNAfold webserver for single MicA se- 
quences. A) E. coli APEC sequence from the current Rfam seed alignment. B) E. coli sequence 
from Storz's sRNA spreadsheet. C) A likely homolog from Erwinia pyrifoliae. Notice the dif- 
ferences in the secondary structure of the first two examples, despite only differing by two 
extra nucleotides at the gene boundaries. The Erwinia prediction only shares a single stem 
with the E. coli predictions, despite relatively high sequence similarity. 

We now begin by following the guidance in section 3.1 to collect an initial 
set of putative homologs. To obtain an initial set of sequences, we BLAST the 
E. coli MicA sequence over the nucleotide collection database limited to the 
enterobacteria (taxonomy id: 543) using the blastn algorithm. The BLAST 
search returns a number of highly similar E. coli sequences, as well as related 
sequences from the closely related S. enterica. As we move down to less similar 
sequences (as judged by their E-values) we identify progressively more evolu- 
tionarily distant organisms. 

From these sequences, we want to select a group of sequences with a rea- 
sonably diverse taxonomic range and as much sequence diversity as possible, 
while being reasonably confident that they are true homologs. In this case 
we will choose based on maximzing genus diversity, a percent id between 75% 
and 90%, and 100% sequence coverage as we're fairly confident in the MicA 
gene boundaries. For our initial alignment, we have chosen sequences from 
Salmonella typhimurium (EMBL-Bank accession: FQ312003), Klebsiella pneu- 
moniae (CP002910), Enterobacter cloaca (CP002272), Yersinia pestis (AM286415), 
Pantoea sp. At-9b (CP002433), and Erwinia pyrifoliae (FP236842). From a 
quick examination with the EN A browser, it appears that all of these sequences 
fall in a intergenic region between a luxS protein homolog and a gshA protein 
homolog, further increasing our confidence that these are true homologs. From 
our results, we can also see a few promising hits that don't quite meet our crite- 
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showing the low E-value results. We are primarily interested in column 2 for genus and 
species information, column 5 for sequence coverage information, and column 7 for percent 
identity informations. 

ria, such as Dickeya, Xenorhabdus, Photorhabdus and Wigglesworthia. We will 

keep these in mind later to expand our coverage. 

Now that we have a starting set of sequences, we can assemble them in to a 
fasta file: 

MJ00096.2 

GAAAGACGCGCATTTGTTATCATCATCCCTGAATTCAGAGATGAAATTTTGGCCACTCACGAGTGGCCTTTTT 
>FQ312003 

GAAAGACGCGCATTTGTTATCATCATCCCTGTTTTCAGCGATGAAATTTTGGCCACTCCGTGAGTGGCCTTTTT 
>CP002272 

GAAAGACGCGCATTTGTTATCATCATCCCTGACTTCAGAGATGAAATGTTTGGCCACAGTGATGTGGCCTTTTT 
>CP002910 

GAAAGACGCGCATTTATTATCATCATCATCCCTGAATCAGAGATGAAAGTTTGGCCACAGTGATGTGGCCTTTTT 
>AM286415 

GAAAGACGCGCATTTGTTATCATCATCCCTGTTATCAGAGATGTTAATTTGGCCACAGCAATGTGGCCTTTT 
>CP002433 

GAAAGACGCGCATTTGTTATCATCATCCCTGACAACAGAGATGTTAATTCGGCCACAGTGATGTGGCCTTTT 
>FP236842 

GAAAGACGCGTATTTGTTATCATCATCTCATCCCTGACAACAGAGATGTTAATTTAGGCCACAGTGACGTGGCCTTTTT 

We can use this to run WAR, and look at the secondary structures predicted by 
each method. One secondary structure appears to dominates the predictions. 
However, its important to check the other predicted secondary structures - do 
any of them pick up convincing substructures that may have been missed by 
other methods? 

In this case, the consensus alignment (see Figure 1) seems to agree well with 
the majority of alignment and structure prediction methods, and is consistent 
with previous experimental probing[92]. We can improve the alignment manu- 
ally. The first basepair in the first stem in CP002433 can be rescued by shifting 
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Figure 4: Alternative structures predicted by the WAR server based on different alignment 
methods. A) T-Coffee consensus alignment, B) CMfinder, and C) StrAL+PETfold. While 
these structures and alignments share some features, the differences in predicted structure 
illustrate the hazard of relying on a single method, even for a short, well- conserved sequence. 



a few nucleotides, and by pulling apart the alignment between the first and 
second stem we reveal what appears to be a well-conserved AAUUU sequence 
motif that was previously hidden (Figure 5). The RNA chaperone Hfq is known 
to bind to A/U rich sequences, so this motif may have some functional signif- 
icance. The strong conservation of the 5' antisense-binding domain provides 
more confidence that these are in fact homologous RNAs. 
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Figure 5: MicA alignment before(top) and after(bottom) manual alignment in RALEE, 
colored for secondary structure and sequence conservation. 

Now we will follow Plan B to add sequences to our alignment using the 
genomes for the low-scoring BLAST hits we had previously made a note of 
while collecting our initial set of sequences, though you could also choose these 
sequences based on your knowledge of your organisms phylogeny or the sus- 
pected function of your RNA. The genomes we've chosen here are Dickeya 
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zeae (CP001655), Sodalis Glossinidius (AP008232), Xenorhabdus nematophila 
(FN667742) and Wiggglesworthia glosinidia (BA000021). Searching these genomes 
allows us to identify strong hits in D. zeae and S. glossinidius with E- values of 
10~ 12 and 1CP 10 which we can merge in to our alignment using the methods in 
section 3.5.1. You should then manually refine the resulting merged alignment 
with an eye towards maintaining conserved sequence motifs and structure. Al- 
ready at this distance, there have been some apparent small decay in secondary 
structure, as well as an expansion of the sequence contained in the loop region 
of the second stem in D. zeae (Figure 6). 
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Figure 6: MicA alignment including merged sequences from D. zeae and S. glossinidius. 

We observe a number of hits in X. nematophila with E- values in the range of 
10~ 2 . By checking each of these individually in the ENA browser, we can iden- 
tify one that falls in the same genomic context as our previous MicA homologs 
(Figure 7). By using this sequence as the starting point for a BLAST search, 
we are able to identify a number of other divergent Xenorhabdus homologs. As 
these are quite diverged from the E. coli sequence, we first construct an align- 
ment for them using WAR (Figure 8), then attempt to merge our alignments 
manually (Figure 9) using shared structural features as our guide. Interestingly, 
the target-binding region of MicA appears to have suffered a poly-A insertion 
down this lineage, suggesting that there may be changes in the regulon it tar- 
gets. Using this model to search all of the bacterial genomes in EMBL-Bank 
(approximately 6GB of sequence, taking 30 hours on a 2.26 GHz Intel Core 
2 Duo processor) shows that our CM now has high-scoring hits exclusively in 
Enterobacteriales, while covering a broader range than our BLAST searches. 
This search also reveals a number of possible sources of additional diversity: 
Photorhabdus asymbiotica and Edwardsiella ictaluri both have strong hits be- 
low the average score for other Enterobacterial genomes - incorporating them 
may further increase the sensitivity of our model, and is left as an exercise to 
the reader. 
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Figure 7: Context of a marginal X. nematophila hit viewed in the ENA genome browser. 
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tt STOCKHOLM 1,0 



tt=GF flU WAR 

FH545200 GCAAGRCGCG- RAAAUUGUUAUCAUC- CUAUUCUCUAAGR + GAUUUUAUUUUGGCCRCUUURRAGUGGCCAUUUU 

FN667741 G- AAGRCGCGCRAAAUUGUUAUI Al IECC . AUUUUCUAAt ■ fl + GAUUUUAUUUCGGCCRCCACU-GGUGGCCAUUUU 

FN667742 G- AAGRUGCGCRAAAUUGUUAUUAUCCCUAUUCUCUAAGRUACUU-UAUU- CGGCCRC- CUCC- GUGGCCAUUUU 

tt=GC SS_cons ««,,« »,»» (((((((,,,,))!)))) 

it=GC RF GcAAGftCGCGCflAAAUUGUUAUCAUCCCUAUUCUCUAAGflGAYUUUAUUUCGGCCfiCCWCI^RGUGGCCAUUUU 
// 

Figure 8: An alignment of Xenorhabdus home-logs. 
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Figure 9: Divergent Xenorhabdus homologs manually merged with the MicA alignment. 
Notice the variation in both secondary structure and sequence conservation added by these 
sequences. 
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